This page last changed on Jun 26, 2008 by tonyj.

How to Fix the Data Catalog Crawler

Owned by: Tony Johnson

The data catalog crawler is used to check the integrity of the data catalog by checking each file shortly after it is registered in the data catalog, and by periodically rechecking files to see that they still exist and have not been changed. In addition the data catalog crawler understands some file formats (some ROOT and Fits files) and can extract additional information from the file and store it as meta-data in the data catalog.

Setup

There are currently two crawlers, one for the DEV database and one for the PROD database.

Mode Host Working directory JMX Port
Prod glastlnx21 ~glast/datacat/prod 8088
Dev glastlnx07 ~glast/datacat/dev 8087

These data crawlers are normally started automatically by cron jobs running on the appropriate hosts. Each crawler writes log files into the work subdirectory of its working directory, in particular:

File Purpose
work/datacat.out Standard output/error from running crawler. Abnormal errors messages may be reported here
work/datacat-N.log Log of activity. (N=0 is the most recent log file)

Checking if the data crawler is running.

Currently use http://glastlnx20.slac.stanford.edu:5080/ (soon to be replaced by Nagios). The crawler is monitored by JMX, and will only reply that its status is OK if it has been actively checking for files in the last 90 seconds (i.e. if it has hung for any reason its status should be reported as not OK). The crawler status can also be checked via the Data Catalog Admin page, and logs messages to the Data Catalog message log.

Starting and Stopping the crawler

To start or stop the server you must log on to the appropriate machine as user glast.
In principle the server can be stopped using:

cd ~glast/datacat/prod/
./stop

and started using

cd ~glast/datacat/prod/
./start
Document generated by Confluence on Jan 21, 2010 11:37